Defining the Gold Standard: MSE
To quantify how far our guess $T$ is from the reality $\psi(\theta)$, we define the Mean Squared Error (Definition 6.3.1):
$$MSE_\theta(T) = E_\theta((T - \psi(\theta))^2)$$
This is the average squared distance between our estimator and the target. A perfect estimator would have an MSE of zero, but in a world of random noise, we strive to minimize it.
Theorem 8.1.1: The Architecture of Error
Why does an estimator fail? Theorem 8.1.1 provides the blueprint. If $T$ has a finite second moment, the error relative to any constant $c$ is given by:
This formula reveals that the total squared error is minimized only when we choose $c = E(T)$. In the context of inference, we set $c = \psi(\theta)$, leading to the famous decomposition:
MSE = Variance + Bias$^2$
The Precision-Accuracy Tradeoff
Imagine two weighing scales in a quality control lab:
- The Precise Relic: It gives the same weight every time (low Variance) but is miscalibrated by 2 grams (high Bias).
- The Erratic Sage: It is correct on average (zero Bias) but oscillates wildly between measurements (high Variance).
Theorem 8.1.1 allows us to calculate exactly which scale provides the lower total error. Often, we are willing to accept a small amount of systematic deviation (Bias) if it drastically reduces the noise (Variance).
Example 8.1.1: Sufficiency and Information
Optimality is tied to Information. Consider a sample space $S = \{1, 2, 3, 4\}$. If outcomes 2, 3, and 4 are equally likely under every possible parameter, they carry the same likelihood. We can define a sufficient statistic $U$ that groups these outcomes together without losing any ability to make an optimal inference. As shown in the simulation, if $L(\cdot|2) = L(\cdot|3) = L(\cdot|4)$, an optimal estimator treats these as a single informative event.